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Abstract 

This paper describes how Minimum Description Length 
(MDL) can be applied to the problem of DNA and pro- 
tein evolutionary tree reconstruction. If there is a set of 
mutations that transform a common ancestor into a set 
of the known sequences, and this description is shorter 
than the information to encode the known sequences di- 
rectly, then strong evidence for an evolutionary relation- 
ship has been found. We describe a heuristic algorithm 
that searches for the simplest tree (smallest MDL) that 
finds close to optimal trees on our test data. Various 
ways of extending the MDL theory to more complex evo- 
lutionary relationships are discussed. 

Introduction 

A major challenge for the 90*8 will be making sense 
of the flood of data generated by the Human Genome 
project. If we know the evolutionary history of genes, we 
can often make strong predictions about their function, 
by comparison with other more distantly related genes 
whose function we do know. Such sequence comparisons 
are routinely performed for newly sequenced genes ua- 
ing existing gene databases (e.g. GenBank), and many 
interesting, unexpected relationships have been discov- 
ered. In addition to using evolutionary relationships 
to predict the function of genes, evolutionary tree re- 
construction is of interest in its own right, as it often 
shows relationships between species that cannot be re- 
constructed from the fossil record alone. Also, evolu- 
tionary information provides useful information about 
protein 3-D structure, because proteins whose structure 
is known can give structural information about related 
proteins. We present a MDL approach to evolutionary 
tree reconstruction below. 

Basic Theory 

The problem addressed in this paper is: “Given a set 
of known DNA sequences, find the most probable evolu- 
tionary tree (or trees) that relates these sequences*. A 
simple example of an evolutionary tree is shown in Fig. 1, 
where the known sequences are shown on the bottom 
row, and all other sequences are hypothetical reconstruc- 


tions of ancestors. The length of a branch from parent to 
child sequence is proportional to time since divergence. 
The mutations that change a parent into the correspond- 
ing child sequence are shown on the connecting branches. 
A tree, such as Fig. 1, is a possible evolutionary model 



that “explains* how the known sequences (on the bottom 
row) were created. However, there are many possible 
trees, all having the same terminal sequences; so which 
tree (or set of trees) should be preferred over another? 
Intuitively, the “simplest* tree seems the most plausible, 
and this intuition has lead to the ‘parsimony* approach 
to evolutionary tree reconstruction [2]. The usual defini- 
tion of parsiminoy is unable to account for more complex 
evolutionary models, such as unequal rates of mutation 
on parallel branches, and so in some cases asymptotically 
approaches the wrong answer as more data is provided 
[2], The MDL approach described below captures the 
basic intuition behind the parsimony approach, but de- 
fines ‘parsimony* in terms of probabilities. 

Although Fig. 1 relates all the known sequences 
through a single tree, we consider the general problem 
to include reconstructing multiple trees that jointly “ex- 
plain* the sequence data. That is, we are not assuming 
that all sequences had a common ancestor at some time 
in the past, although our example will be developed un- 
der this assumption. For a set of related proteins, a 
graph (instead of a tree) is sometimes the correct repre- 
sentation of the evolutionary events 1 . 

1 Sometime* * protein evolve* from * combin*tion of part* of 


MDL Formulation of Problem 

By Bayes Theorem, the relative posterior probability ra- 
tio of two different trees T, and 2y given a set of known 
sequences 5 is: 

p(7i|S) _ p (r,)p(5irQ ^ 

R5TI5) ~p(TMSMY 

By takixig logarithms of this equation and negating we 

— log p(Ti 1 5) - (- log p(Ty 1 5) = 

- log p{Ti) - (- logp(Ty)) 

— logp(5|T<) - (-logp(5|Ty) (2) 

From information theory, -logp< is the minimum pos- 
sible message length to encode the tth outcome if this 
outcome had probability p, . If the base of the logarithm 
is 2, then the message length is in bits. It is clear from 
Eqn. (2) that the particular tree T) with the maximum 
posterior probability relative to any other tree Ty is also 
the tree with the shortest relative encoding (MDL). For 
each tree Ti, the MDL consists of two parts: a part to 
describe the model selected (Ti) and a part to describe 
the data given the model. To apply Eqn. (2) we need 
the prior probabilities of trees, p(T*), and the likelihood 
function, p(5|T*). Note in the following, a tree T refers 
to the hypothetical construct only and does not include 
the known sequences 5 Also, we only calculate the length 
of the theoretical minimum message; we do not actually 
do the encoding. 

Tree Prior Probabilities 

If the user has prior information about the target evo- 
lutionary tree, such as from fossil evidence, then this 
can be used directly in Eqn. (2), but this information is 
rarely available. A common evolutionary assumption is 
that the probability of a mutational event or branching 
event occurring per unit time is independent of the abso- 
lute time. These independence assumptions imply that 
the probability of a sub-tree is independent of the events 
in the super tree and only depends on the immediate 
parent and the time since divergence from the parent. 
Symbolically, these independences can be expressed as: 

p(Tree) = p(Root) 

p(#branches)p(Subtree! |Root) • • p(Subtree„ jRoot) 

where the probabilities p (Subtree; | Root) are recursively 
defined by a similar decomposition in which a subtree 

cxistinc protsin* (domain*), and »o can have multiple parents. 


only depends on its immediate parent. This recursion 
stops when a subtree has only known terminal sequences 
for children. Taking Logarithms of this equation, includ- 
ing the recursive expansion of the subtrees, leads to a 
simple additive form for -logp(7<). This additive form 
corresponds to a recursive coding scheme, where all im- 
mediate children are described as the result of a partic- 
ular set of mutations of the parent. Since the root does 
not have a parent, it must be described separately. The 
leaf nodes are not regarded as part of the tree T*. How- 
ever, the definition of the conditional probability of a 
child given its parent is the same everywhere in the tree, 
including the leaf nodes. Because of the information the- 
oretic interpretation of Bayes theorem, we believe that 
choosing a coding scheme is equivalent to accepting par- 
ticular prior probabilities (and wee versa) . 

Sequence Probabilities 

Except for the root, which is coded directly, the infor- 
mation required to describe all other nodes in the tree is 
reduced by describing each child given its parent. This 
reduced encoding uses the conditional probability of a 
child sequence given (a) its parent, (b) a set of muta- 
tions that transform the parent into the child and (c) a 
time difference between parent and child, i.e. 

P( 5 child. ^parent- ™ utioM - time difference). (3) 

This requires finding the most probable set of mutational 
events that could have transformed the parent into the 
child, or at least a very probable set. For example, in 
Fig. 1, two alternative sets of mutation between PI and 
54 are given — many more are possible. We note that 
the tree building procedure described by in (2) finds the 
maximum likelihood tree topology and branch lengths. 
It does not infer particular ancestral sequences, but av- 
erages over all possible ancestral sequences. This ap- 
proach is answering a different question than the one 
addressed here, and it has difficulty taking into account 
insertions and deletions. Methods we use for finding the 
most probable mutation events and time differences are 
described below, but here we assume they are known. 
There are three types of time dependent mutations that 
transform the parent sequence into the corresponding 
child sequence; point mutations, insertions and dele- 
tions. Our coding scheme for a sequence transformation 
is as follows. 

1. Deletions. At each letter 2 in the parent string, 
we state whether it is the beginning of a deleted 

aw. will us* th« term 'letter* to r.f.r to .ith.r nucleic »cid* 
or or amino acids, as appropriate. 



string or not. Thu., if there are 300 letter, in the 
parent, we encode up to 300 Yet or No messages. 
Since deletion, are rare, each No menage is typi- 
cally a .mall fraction of a bit; it. length u given by 
-logll - p«(t)l, while a longer Yet me»age ha. 
length -logpLio biU. The Yet and iVo menage, 
resume from the next letter that was not deleted. 
Whenever a deletion event occur., the length of 
the deletion (beginning at the cunent letter) i» *• 
scribed, u.ing - logp(n) bit., where ft is the length 
of the deletion. The t parameter is the number of 
time unit, between parent and child. Note that we 
are assuming that deletion length probabilities are 


independent of tune. 

2. Point mutation.. For each letter we give a mes- 
sage de^ribing the fate of that letter of length 
— log p(new | old), where we use the fact that the 
probability that a particular base or residue will 
turn into another depend, on what it was before. 
For example, a purine base is about twice a. likely 
to turn into another purine (i.e. A — G) than it 
is to a pyramidine (e.g. A — ► (T or C)). An ex- 
ample change probability matrix is shown m Fig. 2. 
This coding scheme make, the simplifying assump- 
tion that the probability of a point mutation doe. 
not depend on iU neighboring base* or residue, or 
iU location in the sequence. Both these assumptions 
are biologically incorrect. For example, a C followed 
byaGil much more likely to be transformed into 
a T than a C followed by A,C or T (CpG decay). 
Similarly, in coding DNA, silent third position baw. 
have a much higher point mutation rate than non- 
redundant position*. 

3 Insertions. These are encoded much the same way 
as we encode deletion*. If the parent is 300 letter, 
long, we encode exactly 301 Yet and No me«ages. 
This time, in addition to it. position and length, 
we must describe the letters in each inserted string. 
Each inserted letter ia described in - log Pi bits. 


The encoding method described above is similar to 
describing a set of ‘edit commands* that transform the 
parent into the child sequence. However, instead of com- 
mands, such as: ‘skip the next 20 letters* , we prefer to 
use the probabilities of different mutations at each posi- 
tion in the sequence. The use of biologically meaningful 
mutation probabilities allows learning to take place ur 
ing tree reconstruction. The above coding procedure al- 
lows the message length (- log p(T,)) of a tree to be cal- 
culated provided all the probability, mentioned above 
are known. The known sequences at the leaf nodes are 
not encoded in the tree description, since they are un- 
plied by it. 


Time-Dependent Probabilities 

For a particular letter, the probability that it will un- 
dergo a point mutation is a function of the tune be- 
..*« th. pmmri »<1 >b« drild. For • 

‘time unit.*, these point mutation probabilities can be 
deduced from the unit-time transition matrix. An ex- 
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Figure 2: Point Mutation Probability Matrix 

To obtain the point mutation probability for n time 
unit., this matrix is raised to the power n. The uni 
time matrix i. dm*. » that dm probabdlfl, of a immt 
imitation in one time nnit i. .01. Even after 25 tune 
unit., the probability of a base remaining unaltered is 
778, so that multiple mutation event* at the .ame lo- 
cation are unlikely even for this large time difference. 
Note that we have quantised time, «. that the mfoima- 
tion required to de^ribe the time between a child and it. 
parent is small (-logp(t)). The me«age length clearly 
depend, on the quantisation level selected, so for a given 
tTtt there i* an optimal quantisation level. For 8imp c 
'L w, Lnm. dm. .bo .bn. b> -bid. 1 PAM (1% pou>. 
mnution rri.) o.,nr. i. «d«pm... An improved vomon 
would dynamically optimise this parameter during tree 


Probability of Known Sequences 

The probability of all the known sequences, p(S\T % ) in 
equation (l), i* given by: 

p(5|T<) = JJp(5/| Immediate parent of 5,); (4) 


vhere 5.- is a particular known sequence. The individual 
probabilities in Eqn. (4) are coded the same way as any 


Dynamic Probability Learning 

If all the component probabilities required above are 
known from past experience, then these can e use i- 
rectly, with no need for learning. However past ■ expen- 
ence usuaUy only provides rough prior probabilities that 



can be used as initial values. As the tree is built, further 
information becomes available from the frequencies of 
types of mutation events in the current tree. To exploit 
this additional information, we compute our tree MDL 
serially, and update the probabilities for the rest of the 
tree, using information from the tree encoded so far. We 
use the standard Bayesian probability update formula, 
illustrated here for the probability of a deletion: 


_ njei + rjt i 

~ N + R 


( 5 ) 


where n^ei i* the observed number of deletions encoun- 
tered so far, N is the total number of letters (deletion 
or not), r<ui is a prior weight for deletions and R is the 
total prior weight. The situation for updating point mu- 
tation probabilities is not as simple, because the muta- 
tions typically occur after a number of time steps, while 
the transition probability matrix is defined for a single 
time step. We solve this problem by arbitrarily assigning 
a given point mutation to a single time step and all the 
other time steps count as no mutation. This approxima- 
tion depends on the probability of alternative multiple 
mutations with the same end result being very unlikely. 

We have now described how to compute the compo- 
nent probabilities in Eqn. (2), so that relative MDL, 
can be found. We use this relative MDL to search for 
the lowest MDL tree, as described below. 


Sequence Alignment 

The problem of finding the most probable mutation list, 
given a parent and a child sequence, is the standard pair- 
wise sequence alignment problem. An alignment algo- 
rithm typically uses a “penalty function that assigns a 
penalty to any proposed mutational events— the goal is 
to find the alignment with the minimum penalty. In the 
MDL approach, these penalties turn out to be mutation 
probabilities in disguise. There are many alignment algo- 
rithms in use. Generally, their performance depends on 
how well the user adjusts the penalties. We have imple- 
mented an MDL based alignment algorithm in LISP that 
uses given probabilities. This allows it to use knowledge 
of the estimated time difference between a parent and 
child, and to adjust itself to the dynamically changing 
posterior probabilities for different types of mutations. 
It is very similar in theory and practice to the indepen- 
dently conceived procedure described by [l]. 

The Tree Building Procedure 

The previous sections describe a MDL measure for decid- 
ing which of two alternative trees is more probable. This 
suggests an iterative improvement procedure for finding 


an optimal tree-start with a tree that is approximately 
right, then look for local improvements of the MDL mea- 
sure. The obvious way of constructing a good initial tree 
is to build an initial (n x n) “distance matrix* based on 
the MDL alignment values, then build the tree bottom- 
up. Unfortunately, the cost of constructing the distance 
matrix can be prohibitive — especially as the alignment 
algorithm will spend a lot of time producing alignments 
of very distantly related sequences that never get used 
by the tree building procedure. To reduce this cost, we 
have implemented a heuristic initial tree building proce- 
dure. Instead of using an alignment algorithm to pro- 
duce distances between sequences, we use a combination 
of approximate measures that correlate with evolution- 
ary distance. The best measures between sequences we 
found, in order of increasing accuracy, are: 

• Length Ratio — sequences that are close evolution- 
arily tend to be about the same length, because in- 
sertions and deletions are rare. 

• Longest common subsequence — since close se- 
quences have fewer mutations, this estimator cor- 
relates negatively with distance. 

• Squared difference of uncommon 

hexamer densities— where a hexamer is a string of 
six letters. This estimator works because in closely 
related sequences the probability that short uncom- 
mon substrings are unaltered is high. 

All these heuristic estimators are cheap to compute com- 
pared 'to a full alignment. The initial tree building proce- 
dure builds a binary tree from the bottom up by adding 
each new node into the tree at the place that minimises 
the total squared error of the combined estimators for 
that node, i.e. it finds the height h and the place m 
the current tree to insert a new parent of the current 
sequence so that the following measure is minimised: 






("* ~ *) a 

[Ok ) 7 


( 6 ) 


where Jfe is the measure index, * is an index ra ; n 8 m 8 ov * r 
all sequences in the current tree, is the A*h heuris- 
tic measure between the current sequence and the »th 
sequence, and a k is the standard deviation of the kth 
measure. Because the positioning of the early members 
of the heuristic tree did not have the benefit of the con- 
straining influence of the later nodes, existing nodes are 
re-added, until the tree does not change. This heuristic 
approach builds remarkably accurate initial trees on our 
test data. Part of the reason for this accuracy is that 
the tree is the result of forcing many independent pieces 



of evidence into a consistent tree, so statistical averag- 
ing compensates for the crudeness of the heuristics. The 
initial tree is used to guide which sequences to align, and 
where to put new ancestors in a bottom-up MDL tree 
construction. 

Unfortunately, this bottom-up tree building procedure 
does not construct optimal trees by the MDL criterion. 
The reason is that in bottom-up tree construction there 
is a degree of arbitrariness in how to assign insertions 
and deletions to a new parent. However, once a tree 
is co ns tructed, we use a set of optimisation operations 
to improve it. For example, whenever common dele- 
tions or insertions are detected on neighboring branches, 
we can change the parent to eliminate them. In all 
these local optimisation steps, the criterion is always 
‘does the proposed change lower the MDL?*. This op- 
portunistic local optimisation procedure runs until it 
is unable to find further improvements. The resulting 
tree is not guaranteed to be the globally optimal MDL, 
but on artificial data where we know the ‘correct* an- 
swer, we found that this procedure gets close. Typi- 
cally, our current implementation gets to within 20% of 
the MDL of the true tree, and does a fairly good job of 
placing related sequences together. Preliminary results 
on a real DNA dataset, containing 126 human alu se- 
quences, yields a slightly smaller description-length than 
a multiple- alignment based on a ‘consensus sequence*. 
Our description is still somewhat worse than the four- 
level hierarchy proposed by Smith and Jurka [3 J 9 bee use 
our optimisation operations are still too simplistic. 

Extended Theory 

We made a number of simplifying assumptions in the tree 
building procedure described above in order to produce 
a working realistic program. The theory can be extended 
to remove many of these assumptions, although the effect 
of such changes on the search procedure may not be sim- 
ple. The simplest extension is to allow the probabilities 
of point mutations to depend on such additional factors 
as neighboring letters, position in sequence, whether the 
sequence is a (RNA) coding sequence or not, different 
point mutation probability matrices (e.g. Fig. 2) on dif- 
ferent branches, etc. In MDL approach, the question to 
be answered is whether the additional information re- 
quired to describe these extra probabilities is paid for in 
improved predictive power. This question is answered by 
seeing if the additional probabilities do indeed produce 
lower MDL than without the additional probabilities. 

Extending the tree building to allow for multiple par- 
ents (a graph) is more challenging because the search 
combinatorics are greater. Constructing a MDL that re- 


flects the fact that some proteins evolved by combining 
domains from very different parent proteins is simple in 
principle. The code must state which parts came from 
which parents and how they are ordered, in addition to 
the usual mutation events. A successful graph building 
program must rely on efficient methods for identifying 
potential building blocks (domains). 

Proteins provide a more interesting possibility for 
MDL — predicting secondary or higher structure from se- 
quence information and information from proteins whose 
structure has been determined. The basic idea is that for 
proteins, instead if just hypothesising particular ances- 
tor sequences, these sequences are segmented into typed 
regions, such as a-helix, 0-turn, etc. fVom a MDL per- 
spective, the question is ‘is the information required to 
describe these typed regions paid for by the informa- 
tion required to describe the sequence data given the re- 
gions?*. For example, the statistics of particular amino 
acids at particular locations in a type of 0-turn could 
provide a reduced encoding. If some proteins in a par- 
ticular tree have known structure, then this greatly en- 
hances the certainty of inferred ancestral structure, since 
secondary structure is strongly conserved. 

Summary 

This paper has described the basic theory of MDL ap- 
plied to the problem of evolutionary tree reconstruction 
from sequence data. Also, a particular search method for 
finding MDL trees was outlined. This heuristic search 
procedure finds a tree as close as possible to the ‘true* 
tree, but is unlikely to find the globally optimal tree. Ex- 
periments with the algorithm on test data showed that 
it captures all the broad families in the true tree, and 
has a message length close to the optimum. These ex- 
periments are just the first step in applying MDL to the 
problem of evolutionary reconstruction, and various ex- 
tensions to the theory are readily adapted to this MDL 
framework. 
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